Skip to content

Implement conversion from LaTeX to our Markup XML #4787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 1, 2025

Conversation

mbollmann
Copy link
Member

@mbollmann mbollmann commented Mar 5, 2025

This reimplements functionality from bin/latex_to_unicode.py within the new library, needed for #4766.

Work in progress. Works in principle, but needs much more test cases to ensure feature parity with the previous implementation. Also, some normalization steps (as done in the old latex_to_unicode() function) are not yet ported.

  • bin/latex_to_unicode.py implements some heuristics to determine if e.g. % or ~ are LaTeX symbols or plain text — we should add that somehow, maybe as a parameter to the conversion functions?

This reimplements most of `bin/latex_to_unicode.py` within the new library.
More tests are needed, and some conversions done in `latex_to_unicode` are still missing.
@mbollmann mbollmann added the python-library Concerning the acl-anthology-py library label Mar 5, 2025
@mbollmann mbollmann self-assigned this Mar 5, 2025
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 99.13043% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.67%. Comparing base (f474131) to head (ecaa270).
Report is 19 commits behind head on python-dev.

Files with missing lines Patch % Lines
python/acl_anthology/utils/latex.py 98.63% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           python-dev    #4787      +/-   ##
==============================================
+ Coverage       93.49%   93.67%   +0.17%     
==============================================
  Files              35       35              
  Lines            2675     2782     +107     
==============================================
+ Hits             2501     2606     +105     
- Misses            174      176       +2     
Files with missing lines Coverage Δ
python/acl_anthology/exceptions.py 89.47% <ø> (ø)
python/acl_anthology/text/markuptext.py 94.73% <100.00%> (-0.27%) ⬇️
python/acl_anthology/utils/__init__.py 100.00% <100.00%> (ø)
python/acl_anthology/utils/text.py 96.77% <100.00%> (+3.91%) ⬆️
python/acl_anthology/utils/xml.py 98.57% <100.00%> (+0.18%) ⬆️
python/acl_anthology/utils/latex.py 99.34% <98.63%> (-0.66%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mbollmann
Copy link
Member Author

@davidweichiang If you have a minute, I’d appreciate if you could have a look at this — I have tried to port the logic of our bin/latex_to_unicode.py, which I think you mainly authored, to the new library, relying on pylatexenc rather than custom parsing + latexcodec. I created several test cases to ensure the functionality is as expected. Could you have a look at them to see if you can think of anything else that is important or maybe tricky to cover when ingesting LaTeX and converting it to our XML format?

The test cases are here: https://github.com/acl-org/acl-anthology/pull/4787/files#diff-e559d67d054b0d61eb1f86a702d5373d2ea14dc6e1ff04aee432e7bcc6e912b3

@mbollmann mbollmann marked this pull request as ready for review June 1, 2025 10:07
@mbollmann mbollmann merged commit 154dfd2 into python-dev Jun 1, 2025
14 checks passed
@mbollmann mbollmann deleted the python-normalize-and-latex-import branch June 1, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python-library Concerning the acl-anthology-py library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant